Word Embedding

Introduction

We discussed TF-IDF as a text encoding technique in the Text Encoding section. While TF-IDF is used even today in some applications, it has some limitations. We start by listing the challenges with TF-IDF and then we will discuss the solution.

Challenges with TF-IDF

Sparsity

Many zeros in the vector which makes the vector sparse and inefficient to process. This is because the vector is mostly zeros and only a few ones. This is a waste of memory and computation.

Curse of dimensionality

TF-IDF vector is as large as the size of the vocabulary. Imagine you have a vocabulary of 1,000,000 words. Then your TF-IDF vector will be 1,000,000 dimensions long. It is not onle inefficient to store but also causes overfitting while training the model. Imagine you have 1,000 examples in your training data and you want to train a model to classify the emails as spam or not spam using TF-IDF and Logistic Regression. Then your model will have 1,000,000 parameters to train. This is a lot of parameters to train and it is not likely to generalize well.

Euclidean Distance issue

Semantically similar words (like man and woman) may not be closer in vector space than dissimilar words (like man and nebula).

Solution: Word Embedding

Word Embedding models map sparse one-hot vectors (e.g., 1,000,000 dimensions) to lower-dimensional dense vectors (e.g., 300 dimensions) where semantically similar words are placed close together so they solve all the problems with TF-IDF mentioned above. Word embedding vectors can be learned using different approaches.

Transformers (Deep Network)

As we’ll discuss in the Transformers section, while training a Transformer model, we can learn the word embedding vectors simultaneously with the model parameters. Here we will discuss Word2Vec

Word2Vec (Shallow Network)

Word2Vec is a shallow network that is the first NLP training method using masking with two approaches:

  • Continuous Bag of Words (CBOW)
  • Skip-gram

Here we only focus on CBOW.

Training Word2Vec

Getting a Text Dataset

First, we need to get a dataset of sentences. Normally, we use a large corpus of text data like Wikipedia, books, or news articles. They are already tokenized and ready to be used. To see how tokenization works, you can refer to the Tokenization section.

Masked Language Modeling

We randomly mask some words in the sentences and predict the masked word using the context of the surrounding words. The context is the words before and after the masked word. The window size is the parameter that determines teh context size. For example in the image above, the window size is 2 which means the context is the 2 words before and after the masked word.

Then, we train the model to predict the masked word using the context of the surrounding words. This is a masked language modeling task. We use a shallow neural network to train the model.

At the end of the training, we get the word embedding vectors for each word in the vocabulary. So, if the vocabulary has 10,000 words, and we use 300-dimensional embeddings, we will get a 10,000 x 300 word embedding matrix.

Characteristics

  • Word embeddings are fixed during inference
  • Context is not considered during inference by the vectors but considered during training
  • Uses a shallow neural network to train the model

Note that althougg word embeddings ignore the context during inference, using sequence models like RNNs or Transformers can consider the context during inference.

Semantic Space

When we reduce Word2Vec embeddings from 300 to 3 dimensions we can visualize semantic relationships as shown in the image at the top of the page.

For example, we can see the relationship between queen and king is the same as the relationship between woman and man as shown in the image above. In other words, the vector space is able to capture the semantic relationships between words. So, queen - king vector is the same as woman - man vector in the embedding space.

As an another example, we see that if we add the embedding vector of walking to swam - swimming vector, we get the embedding vector of walked as shown in the image above.

As the last example, for the country capitals, you can see that we can go from word embedding of each country capital to the word embedding of the country using the same vector as shown in the image at the top of the page.